In this project, you will apply unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.
This notebook will help you complete this task by providing a framework within which you will perform your analysis steps. In each step of the project, you will see some text describing the subtask that you will perform, followed by one or more code cells for you to complete your work. Feel free to add additional code and markdown cells as you go along so that you can explore everything in precise chunks. The code cells provided in the base template will outline only the major tasks, and will usually not be enough to cover all of the minor tasks that comprise it.
It should be noted that while there will be precise guidelines on how you should handle certain tasks in the project, there will also be places where an exact specification is not provided. There will be times in the project where you will need to make and justify your own decisions on how to treat the data. These are places where there may not be only one way to handle the data. In real-life tasks, there may be many valid ways to approach an analysis task. One of the most important things you can do is clearly document your approach so that other scientists can understand the decisions you've made.
At the end of most sections, there will be a Markdown cell labeled Discussion. In these cells, you will report your findings for the completed section, as well as document the decisions that you made in your approach to each subtask. Your project will be evaluated not just on the code used to complete the tasks outlined, but also your communication about your observations and conclusions at each stage.
# Make sure any changes to custom packages can be reflected immediately
# in the notebook without kernel restart
import autoreload
%load_ext autoreload
%autoreload 2
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly_express as px
import pandas_profiling as pdp
# magic word for producing visualizations in notebook
%matplotlib inline
# Setup some nice defaults for dataframe display
pd.set_option("display.max_columns", 85)
pd.set_option("display.max_rows", 85)
# Given the expected size of our DataFrames, need to disable jedi for speeding up autocompletion
%config Completer.use_jedi = False
There are four files associated with this project (not including this one):
Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).Data_Dictionary.md: Detailed information file about the features in the provided datasets.AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columnsEach row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. You will use this information to cluster the general population into groups with similar demographic properties. Then, you will see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.
To start off with, load in the demographics data for the general population into a pandas DataFrame, and do the same for the feature attributes summary. Note for all of the .csv data files in this project: they're semicolon (;) delimited, so you'll need an additional argument in your read_csv() call to read in the data properly. Also, considering the size of the main dataset, it may take some time for it to load completely.
Once the dataset is loaded, it's recommended that you take a little bit of time just browsing the general structure of the dataset and feature summary file. You'll be getting deep into the innards of the cleaning in the first major step of the project, so gaining some general familiarity can help you get your bearings.
# Load in the general demographics data.
gen_pop = pd.read_csv('data/Udacity_AZDIAS_Subset.csv', sep=';')
gen_pop.head()
# Load in the feature summary file.
feat_info = pd.read_csv('data/AZDIAS_Feature_Summary.csv', sep=';')
feat_info.head()
# Check the structure of the data after it's loaded (e.g. print the number of
# rows and columns, print the first few rows).
gen_pop.info(memory_usage='deep')
The feature summary file contains a summary of properties for each demographics data column. You will use this file to help you make cleaning decisions during this stage of the project. First of all, you should assess the demographics data in terms of missing data. Pay attention to the following points as you perform your analysis, and take notes on what you observe. Make sure that you fill in the Discussion cell with your findings and decisions at the end of each step that has one!
The fourth column of the feature attributes summary (loaded in above as feat_info) documents the codes from the data dictionary that indicate missing or unknown data. While the file encodes this as a list (e.g. [-1,0]), this will get read in as a string object. You'll need to do a little bit of parsing to make use of it to identify and clean the data. Convert data that matches a 'missing' or 'unknown' value code into a numpy NaN value. You might want to see how much data takes on a 'missing' or 'unknown' code, and how much data is naturally missing, as a point of interest.
As one more reminder, you are encouraged to add additional cells to break up your analysis into manageable chunks.
# Look at naturally missing data first
missing = pd.DataFrame(gen_pop.isnull().sum()).rename(columns = {0: 'total missing'})
missing['percent missing'] = round(missing['total missing'] / len(gen_pop),2)
missing.sort_values('total missing', ascending = False, inplace=True)
missing
none_missing_count = len(missing[missing['total missing'] > 0])
print(
f"{none_missing_count} columns have a nonzero number of naturally missing values")
Oof, looks like the consumer behavior type (KK_KUNDENTYP) is pretty sorely lacking values. Hard to say if it will be the dominant "missing value" feature once we fill in the coded missing values with np.nan, but given that it's more than 60% missing I'll probably need to drop it as a feature. In my experience, features with that many missing values are rarely useful for modeling.
# Convert the strings for the missing values from the feature summary
# To be proper lists of values to use for filling in NaNs
# First, remove brackets
# Then, split on comma separator
feat_info.loc[:, 'missing_or_unknown'] = \
feat_info.loc[:, 'missing_or_unknown'].str[1:-1].str.split(',')
feat_info.head()
# What columns are object (probably mixed) type?
gen_pop.columns[gen_pop.dtypes == 'object']
# What unique data types exist within the object-type columns?
col_types = {}
for col in gen_pop.columns[gen_pop.dtypes == 'object']:
types = set()
col_types[col] = set([type(e) for e in gen_pop[col]])
col_types
Good to know! Looks like all object-type columns are of mixed float and str type. This will help us do intelligent typecasting as we parse through the feat_info missing codes below.
A few other notes (ignoring codes used for missing):
OST_WEST_KZCAMEO_DEUG_2015CAMEO_DEU_2015CAMEO_INTL_2015def fill_missing(df, missing_codes_mapping, inplace=False):
'''
Parses dataframe of missing values and their mapping to individual feature names
and then fills any of those values found in a dataframe's matching feature columns
with np.nan.
Inputs
------
df: pandas DataFrame. Table with features that match the ones for which we have
missing mappings. Each sample is a person.
missing_codes_mapping: pandas DataFrame. Contains columns 'attribute' and
'missing_or_unknown' that map codes used for missing/unknown values to
features/attributes. 'missing_or_unknown' is expected to have elements
that are lists of str (usually ints, but sometimes chars or empty lists).
Returns
-------
df with NaN values filled in according to missing_codes_mapping
'''
# Use deep copy if inplace = False, otherwise use shallow copy
data = df.copy(deep=not inplace)
missing_codes = missing_codes_mapping.copy(deep=not inplace)
def parse_missing_codes(code_list):
'''
Goes through a list of str and converts the elements of the list according to the needs
of the dtypes in our demographic data such that the results can be used for
filling in NaN values.
Inputs
------
code_list: list of str. List is expected to contain the chars, floats, or ints
that are codes indicating a missing or unknown value.
Returns
-------
list or np.nan. Each element of the list returned is typecast according to
the expected needs of the NaN-filling it will be doing. Empty lists
(or lists with only an empty string in them) are returned as np.nan.
'''
# Make sure list isn't just empty string
if '' not in code_list:
# Check if list can be converted to int without issues - if so, do it
try:
return [int(e) for e in code_list]
# Not all are cast-able to int
except ValueError:
return [float(e) if 'X' not in e else e for e in code_list]
else:
return np.nan
# Typecast missing value codes appropriately
missing_codes.loc[:, 'missing_or_unknown'] = \
missing_codes.loc[:, 'missing_or_unknown'].apply(parse_missing_codes)
# Create series that maps feature names (index) to missing codes (data)
code_map = pd.Series(data=missing_codes['missing_or_unknown'].values,
index=missing_codes['attribute'].values)
# When passing a Series into to_replace, index is key and data is value (like a dict)
data.replace(to_replace=code_map,
value=np.nan,
inplace=True)
return data
# Replace codes we know to mean "missing" or "unknown" with NaN
gen_pop = fill_missing(gen_pop, feat_info)
gen_pop.head()
How much missing data is present in each column? There are a few columns that are outliers in terms of the proportion of values that are missing. You will want to use matplotlib's hist() function to visualize the distribution of missing value counts to find these columns. Identify and document these columns. While some of these columns might have justifications for keeping or re-encoding the data, for this project you should just remove them from the dataframe. (Feel free to make remarks about these outlier columns in the discussion, however!)
For the remaining features, are there any patterns in which columns have, or share, missing data?
# What do the missing data counts look like now, by column?
missing_filled = pd.DataFrame(gen_pop.isnull().sum()).rename(columns = {0: 'total missing'})
missing_filled['percent missing'] = round(missing_filled['total missing'] / len(gen_pop),2)
missing_filled.sort_values('total missing', ascending = False, inplace=True)
missing_filled
Well that's a non-trivial change! The first profile I generated prior to filling in missing values had 6.4% of all values missing, now we're at 10.6%! Additionally:
KK_KUNDENTYP) and 32 features with no missing values at all. np.nan, we end up with 9 having more than 15% of values missing and 25 with no missing values.AGER_TYPE has 77% of its values missing. It wasn't even flagged as special prior to the NaN-filling step.ALTER_HH went from 8.2% missing to 34.8% missing!none_missing_count = len(missing[missing['total missing'] > 0])
print(
f"{none_missing_count} columns have a nonzero number of missing values now")
# Visualize the missing count across columns
fig, (ax1, ax2) = plt.subplots(nrows=2)
# Total counts of missing
sns.distplot(gen_pop.isnull().sum(axis=0), bins=50, ax=ax1)
ax1.set(title=f'Distribution of Missing Values in Each Column')
# Percentage missing
sns.distplot(gen_pop.isnull().sum(axis=0)/len(gen_pop), bins=50, ax=ax2)
ax2.set(title=f'Distribution of Missing Values in Each Column', ymargin=-0.4)
plt.tight_layout()
It seems pretty clear here that any features that have more than 200k (20%) missing values are "outliers" in the sense that they are far from the majority of features which are below this threshold. But, that being said, perhaps we should look at each feature first and determine if it might have a lot of relevant information useful for analysis, even if it is substantially lacking values?
missing_filled[missing_filled['total missing'] > 2E5]
Here are the meanings of these six features:
TITEL_KZ: Academic Title - categoricalAGER_TYP: Elderly Type - categoricalKK_KUNDENTYP: Household-level customer type - categoricalRETOURTYP_BK_S, I think we can drop itKBA05_BAUMAX: RR3: Most Common Bldg Type - mixedKBA05 featuresGEBURTSJAHR: Birth Year - numericALTERSKATEGORIE_GROB feature, so we'll retain itALTER_HH: Head of Household birthdate (bin) - intervalGEBURTSJAHR as it only provides the head of household's birth year in 5-year bins. So given that fact and that it's still significantly more missing values than the rest of the features with less than 200,000 missing values, I'm going to drop it.# Remove the outlier columns from the dataset
outlier_cols = ['ALTER_HH', 'KBA05_BAUMAX', 'KK_KUNDENTYP',
'AGER_TYP', 'TITEL_KZ']
gen_pop.drop(columns=outlier_cols, inplace=True)
# Re-assess after dropping outlier features
missing_noOutliers = pd.DataFrame(gen_pop.isnull().sum()).rename(columns = {0: 'total missing'})
missing_noOutliers['percent missing'] = round(missing_noOutliers['total missing'] / len(gen_pop),2)
missing_noOutliers.sort_values('total missing', ascending = False, inplace=True)
missing_noOutliers
# What does missing count distribution look like now?
sns.distplot(gen_pop.isnull().sum(axis=0)/len(gen_pop),
bins = 50)
# First things first: if I'm going to make sense out of these, I need more helpful names
# Read in the names mapping CSV as a dict
new_names = pd.read_csv('col_renaming.csv', header=None, index_col=0,
squeeze=True).to_dict()
gen_pop.rename(columns=new_names, inplace=True)
# Need to stop sorting by amount missing and leave in original sort order
# so I can more easily compare to the data dictionary file
missing_noSort = pd.DataFrame(gen_pop.isnull().sum()).rename(columns = {0: 'total missing'})
missing_noSort['percent missing'] = round(missing_noSort['total missing'] / len(gen_pop),2)
missing_noSort[missing_noSort['total missing'] < 50000]
#missing_noSort[missing_noSort['total missing'] > 50000]
#missing_noSort[missing_noSort['total missing'] < 50000]
missing_noSort[missing_noSort['total missing'] > 50000]
Some notes on patterns I noticed during my exploration of the missing data as a function of column/feature:
np.nan, we had 6.4% of all values reported as missing. Afterwards, 10.6% of all values were missing.RR1: Online Affinity and HH: Net Income Bins) are person-level measures (30, in other words). Given that ~50% of all features are person-level, this suggests that person-level features are overrepresented in the subset (if the information levels were distributed the same in this sample as they are in the total population, we'd expect only 16-17 features to be person-level). Age Bin with such a relatively low number of missing values. That seems really odd, but likely it is due to the method used for determining the age of an individual. The Age Bin values cover 15-30 year chunks of a person's life, whereas the birth year and head of household birth date features had resolutions of 1 and 5 years, resp.Consumption Channel Type, Vacation Habits, Customer Type, Socioeconomic Status - HighRes, Socioeconomic Status - LowRes, and Online Affinity all have the exact same number of missing values. This suggests to me that they all came from the same data source and likely used the same methodology for their derivation, as this pattern seems a little too consistent to be random.Age Bin feature should be in the lowest missing value count group and these in the highest. Even though they have significantly higher resolutions than the Age Bin feature, they must still be drawn from very different data sources, since they shouldn't be so different. Strange.Now, you'll perform a similar assessment for the rows of the dataset. How much data is missing in each row? As with the columns, you should see some groups of points that have a very different numbers of missing values. Divide the data into two subsets: one for data points that are above some threshold for missing values, and a second subset for points below that threshold.
In order to know what to do with the outlier rows, we should see if the distribution of data values on columns that are not missing data (or are missing very little data) are similar or different between the two groups. Select at least five of these columns and compare the distribution of values.
countplot() function to create a bar chart of code frequencies and matplotlib's subplot() function to put bar charts for the two subplots side by side.Depending on what you observe in your comparison, this will have implications on how you approach your conclusions later in the analysis. If the distributions of non-missing features look similar between the data with many missing values and the data with few or no missing values, then we could argue that simply dropping those points from the analysis won't present a major issue. On the other hand, if the data with many missing values looks very different from the data with few or no missing values, then we should make a note on those data as special. We'll revisit these data later on. Either way, you should continue your analysis for now using just the subset of the data with few or no missing values.
# How many duplicate rows are there?
dups = len(gen_pop) - len(gen_pop.drop_duplicates())
print(f"There are {dups} duplicate rows in the dataset, \
representing {round(dups/len(gen_pop)*100,2)}% of all samples.")
This is a non-trivial fraction of data that are duplicates. However, since we don't have person-level IDs for each record, it's impossible to tell if these are truly duplicates or merely extremely similar people. As such, we'll leave these alone.
# What do the missing data counts look like now, by row?
rows_oneMissing_filled = len(gen_pop) - len(gen_pop.dropna(how='any'))
rows_allMissing_filled = len(gen_pop) - len(gen_pop.dropna(how='all'))
print(f"There are {rows_oneMissing_filled} samples with at least one feature missing a value \
\nand {rows_allMissing_filled} samples missing values for all features")
# How much data is missing in each row of the dataset?
fig, ax = plt.subplots()
fig.suptitle('Fraction of Different Amounts of Missing Values Across Rows')
sns.distplot(gen_pop.isnull().sum(axis=1), kde=False, norm_hist=True, ax=ax)
ax.set(xlabel='Number of Columns with Missing Values')
# Zoom in
fig, ax = plt.subplots()
fig.suptitle('Fraction of Different Amounts of Missing Values Across Rows')
sns.distplot(gen_pop.isnull().sum(axis=1), kde=False, norm_hist=True, ax=ax)
ax.set(xlabel='Number of Columns with Missing Values', xlim=(0,12))
# Provide a rectangle between y=0.025 and y=0.05 to aid the eye in estimation
ax.axhspan(0.025,0.05, color='red', alpha=0.25)
Well, it looks like a pretty safe bet to say that we should subset the data into rows with up to 9 missing values and rows with 10+ missing values, since the former comprises around 85% of the data it seems.
missing_values_breakpoint = 9
# Write code to divide the data intgen_popwo subsets based on the number of missing
# values in each row.
gen_pop_lowMissing = gen_pop.loc[gen_pop.isnull().sum(axis=1) <= missing_values_breakpoint,:]
#gen_pop_highMissing = gen_pop.loc[gen_pop.isnull().sum(axis=1) >= missing_values_breakpoint,:]
# Compare the distribution of values for at least five columns where there are
# no or few missing values, between the two subsets.
def dist_compare(df, missing_count, column):
'''
Compares the distribution of values for a single feature
between two subsets of a DataFrame, subsetting based upon the number of missing values
allowed per row (e.g. one distribution is only of rows with less than X missing values,
the other is only of rows with at least X missing values).
Inputs
------
df: pandas DataFrame in long format
missing_count: int. Number of missing values to use as breakpoint for subsetting df
column: str. Column name of feature from the DataFrames that you're investigating.
Returns
-------
Matplotlib figure comparing the two distributions.
'''
fig, (ax_lowMissing, ax_highMissing) = plt.subplots(ncols=2)
fig.suptitle(f'Distributions of Mostly Complete Features - {column}\
\nMissing Values Breakpoint = {missing_count}',
fontsize=15, y=1.2)
plt.figure(figsize=(20,5))
sns.distplot(df.loc[df.isnull().sum(axis=1) <= missing_count, column].dropna(),
ax=ax_lowMissing, kde=False, norm_hist=True)
ax_lowMissing.set(title=f'Rows with Fewer \nMissing Values')
sns.distplot(df.loc[df.isnull().sum(axis=1) > missing_count, column].dropna(),
ax=ax_highMissing, kde=False, norm_hist=True)
ax_highMissing.set(title=f'Rows with {missing_count + 1}+ \nMissing Values')
plt.tight_layout()
# The features that are completely or almost completely filled
# and likely to show more than just two values in the distribution
full_cols = ['Energy Consumption Type', 'MoneyType__Primary',
'Age Bin', 'Customer Type', 'RR1: Online Affinity',
'Vacation Habits']
dist_compare(gen_pop, missing_values_breakpoint, 'Energy Consumption Type')
It's clear from these plots of Energy Consumption Type that, while both subsets seem to have a majority of people with an energy consumption pattern of 3 ("fair supplied"), this majority is a larger fraction of the total subset for the data with a higher missing value count.
dist_compare(gen_pop, missing_values_breakpoint, 'MoneyType__Primary')
Much like with Energy Consumption Type, there seems to be a dominant feature value for MoneyType__Primary when you look at the subset with a high missing value count. In this case, it's code 4: "be prepared". There's also a clearly dominant code for the low missing values data: code 6 ("inconspicuous").
dist_compare(gen_pop, missing_values_breakpoint, 'Age Bin')
This one is a little more subtle than the preceding two. For the most part, the two subset distributions for Age Bin look the same, with the only notable difference being that code 4 (> 60 years old) is a smaller fraction of the overall distribution for the high missing values subset than it is for the low missing values one.
dist_compare(gen_pop, missing_values_breakpoint, 'Customer Type')
As with Age Bin, Customer Type has a more subtle distributional difference between the two ssubsets. That said, it looks like it would be fair to say that code 5 ("determined Minimal-Returner") is the dominant feature for the low missing values subset and code 3 ("incentive-receptive Normal-Returner") for the other subset.
dist_compare(gen_pop, missing_values_breakpoint, 'RR1: Online Affinity')
For RR1: Online Affinity, there can be no doubt: code 2 ("middle") is the dominant feature for the high missing values subset, and the codes are pretty uniformly distributed for the low missing values subset.
dist_compare(gen_pop, missing_values_breakpoint, 'Vacation Habits')
For Vacation Habits, the pattern is clear: code 5 (nature fans) is the dominant feature for the high missing values subset and the distribution is pretty uniform across codes (although with a dip for codes 6 and 7 and an interesting overall positive slope) for the low missing values subset.
Ultimately, it seems like the distributions are more uniform for the rows with low missing value counts than they are for the rows with high missing value counts. As such, it's important to note that the distribution, at least of these features we spot-checked, is not so similar between the two subsets that we can easily say we've lost no information by excluding the rows with higher amounts of missing values.
Checking for missing data isn't the only way in which you can prepare a dataset for analysis. Since the unsupervised learning techniques to be used will only work on data that is encoded numerically, you need to make a few encoding changes or additional assumptions to be able to make progress. In addition, while almost all of the values in the dataset are encoded using numbers, not all of them represent numeric values. Check the third column of the feature summary (feat_info) for a summary of types of measurement.
In the first two parts of this sub-step, you will perform an investigation of the categorical and mixed-type features and make a decision on each of them, whether you will keep, drop, or re-encode each. Then, in the last part, you will create a new data frame with only the selected and engineered columns.
Data wrangling is often the trickiest part of the data analysis process, and there's a lot of it to be done here. But stick with it: once you're done with this step, you'll be ready to get to the machine learning parts of the project!
# Add new attribute names to feat_info
feat_info['attribute_renamed'] = pd.read_csv('col_renaming.csv', header=None)[1]
feat_info
# How many features are there of each data type?
# I'm only looking at the features that still remain in gen_pop after dropping outlier features
feat_info_remaining = feat_info.loc[feat_info['attribute_renamed'].isin(gen_pop.columns), :]
feat_info_remaining['type'].value_counts()
For categorical data, you would ordinarily need to encode the levels as dummy variables. Depending on the number of categories, perform one of the following:
# Assess categorical variables: which are binary, which are multi-level, and
# which one needs to be re-encoded?
feat_info_remaining[feat_info_remaining['type'] == 'categorical']
# Figure out what different values are available for each feature
cols = feat_info_remaining.loc[feat_info_remaining['type'] == 'categorical', 'attribute_renamed']
unique_cat_vals = pd.DataFrame()
for col in cols:
temp = pd.DataFrame({'attribute': col, 'unique values': gen_pop[col].unique()})
unique_cat_vals = pd.concat([unique_cat_vals, temp])
unique_cat_vals.groupby(['attribute']).count().sort_values('unique values')
# What are the values taken on by the binary features?
binary_cols = unique_cat_vals.groupby(['attribute']).count().sort_values('unique values').index[:5]
unique_cat_vals[unique_cat_vals['attribute'].isin(binary_cols)]
The following features appear to be binary in nature:
GenderYoung EnvironmentalistBldg: Location Relative to E or W GermanySmall or Home Office OwnerNaN values, which means I need to use dummy variables with it too (so NaN can be reflected by 0,0)Insurance TypeNaN values, so I'll need to use a dummy with it too!That leaves the following features for re-encoding through dummies:
Nationality Based on NameShopper TypeSocioeconomic Status - LowResFamily Type - LowResMoneyType__PrimaryEnergy Consumption TypeConsumption Channel TypeBldg: Building TypeRR4: Life Stage Type - LowResSocioeconomic Status - HighResFamily Type - HighResVacation HabitsRR4: Life Stage Type - HighRes# Re-encode categorical variable(s) to be kept in the analysis.
cat_cols = ['Bldg: Location Relative to E or W Germany',
'Small or Home Office Owner', 'Insurance Type',
'Nationality Based on Name', 'Shopper Type',
'Socioeconomic Status - LowRes', 'Family Type - LowRes',
'MoneyType__Primary', 'Energy Consumption Type',
'Consumption Channel Type', 'Bldg: Building Type',
'RR4: Life Stage Type - LowRes', 'Socioeconomic Status - HighRes',
'Family Type - HighRes', 'Vacation Habits',
'RR4: Life Stage Type - HighRes']
# Have to drop first dummy so avoid Dummy Variable Trap
# Have to include NaN dummy so zero-vector can't be interpreted ambiguously
# as either NaN or first dropped dummy (which are likely different)
gen_pop_lowMissing = pd.get_dummies(
gen_pop_lowMissing, prefix_sep='__',
drop_first=True, dummy_na=True,
columns=cat_cols)
gen_pop_lowMissing.columns
# What does our dataframe look like now?
gen_pop_lowMissing.info()
gen_pop_lowMissing.dtypes.value_counts()
# What's the remaining object dtype column?
gen_pop_lowMissing.columns[gen_pop_lowMissing.dtypes == 'object']
# What are the values in this object-type column?
gen_pop_lowMissing["RR4: Life Stage Type - Int'l Code Mapping"].value_counts()
Since these are all integers, except for the np.nan values, we'll typecast this to int once we've gotten rid of all of the missing values, either by imputing or dropping them as the situation warrants.
I treated these features as binary and not requiring further engineering:
GenderYoung EnvironmentalistI treated these features as binary but requiring dummy variables due to their encoding or due to the presence of NaN values:
Bldg: Location Relative to E or W GermanySmall or Home Office OwnerInsurance TypeAnd I used dummies to encode these multi-level categoricals:
Nationality Based on NameShopper TypeSocioeconomic Status - LowResFamily Type - LowResMoneyType__PrimaryEnergy Consumption TypeConsumption Channel TypeBldg: Building TypeRR4: Life Stage Type - LowResSocioeconomic Status - HighResFamily Type - HighResVacation HabitsRR4: Life Stage Type - HighResThere are a handful of features that are marked as "mixed" in the feature summary that require special treatment in order to be included in the analysis. There are two in particular that deserve attention; the handling of the rest are up to your own choices:
Be sure to check Data_Dictionary.md for the details needed to finish these tasks.
Mapping for Generation Designation Re-Encoding:
| Original Code | Original Code Meaning | Decade Code | Movement Code |
|---|---|---|---|
| 1 | 40s - war years (Mainstream, E+W) | 0 | 0 |
| 2 | 40s - reconstruction years (Avantgarde, E+W) | 0 | 1 |
| 3 | 50s - economic miracle (Mainstream, E+W) | 1 | 0 |
| 4 | 50s - milk bar / Individualisation (Avantgarde, E+W) | 1 | 1 |
| 5 | 60s - economic miracle (Mainstream, E+W) | 2 | 0 |
| 6 | 60s - generation 68 / student protestors (Avantgarde, W) | 2 | 1 |
| 7 | 60s - opponents to the building of the Wall (Avantgarde, E) | 2 | 1 |
| 8 | 70s - family orientation (Mainstream, E+W) | 3 | 0 |
| 9 | 70s - peace movement (Avantgarde, E+W) | 3 | 1 |
| 10 | 80s - Generation Golf (Mainstream, W) | 4 | 0 |
| 11 | 80s - ecological awareness (Avantgarde, W) | 4 | 1 |
| 12 | 80s - FDJ / communist party youth organisation (Mainstream, E) | 4 | 0 |
| 13 | 80s - Swords into ploughshares (Avantgarde, E) | 4 | 1 |
| 14 | 90s - digital media kids (Mainstream, E+W) | 5 | 0 |
| 15 | 90s - ecological awareness (Avantgarde, E+W) | 5 | 1 |
def data_mapper(series, mapping_dict):
'''
Reads in a pandas Series object that represents the Generation Designation feature
and returns a re-encoded series according to mapping_dict.
Inputs
------
series: pandas Series of integer codes (1 through 15) representing different
Generation Designation values
mapping_dict: dict of form {designation_code: new_code} used to determine what
values to return
Returns
-------
pandas Series with the new codes
'''
# Since NaN values aren't always propagated as expected, do a quick check
print(f"There are {series.isnull().sum()} null values in the series \
{series.name} prior to extraction")
out = series.map(mapping_dict, na_action = 'ignore')
print(f"There are {out.isnull().sum()} null values in the series \
{series.name} after extraction")
return out
decade_code_map = {
1: 0,
2: 0,
3: 1,
4: 1,
5: 2,
6: 2,
7: 2,
8: 3,
9: 3,
10: 4,
11: 4,
12: 4,
13: 4,
14: 5,
15: 5
}
# Investigate "PRAEGENDE_JUGENDJAHRE" and engineer two new variables.
# Variable 1: generation by decade
gen_pop_lowMissing['Generation Decade'] = \
data_mapper(gen_pop_lowMissing['Generation Designation'],
decade_code_map)
# Does it seem like this worked?
# Should be min of 0 and max of 5
gen_pop_lowMissing['Generation Decade'].describe()
movement_code_map = {
1: 0,
2: 1,
3: 0,
4: 1,
5: 0,
6: 1,
7: 1,
8: 0,
9: 1,
10: 0,
11: 1,
12: 0,
13: 1,
14: 0,
15: 1
}
# Feature: "PRAEGENDE_JUGENDJAHRE"
# Variable 2: movement type
gen_pop_lowMissing['Generation Movement'] = \
data_mapper(gen_pop_lowMissing['Generation Designation'],
movement_code_map)
# Does it seem like this worked?
# Should be min of 0 and max of 1
gen_pop_lowMissing['Generation Movement'].describe()
# Drop Generation Designation feature now that we've captured its information in the new features
gen_pop_lowMissing.drop(columns='Generation Designation', inplace=True)
Mapping for RR4: Life Stage Type - Int'l Code Mapping Re-Encoding:
| Original Code | Original Code Meaning | Wealth Code | Life Stage Code |
|---|---|---|---|
| 11 | Wealthy Households - Pre-Family Couples & Singles | 1 | 1 |
| 12 | Wealthy Households - Young Couples With Children | 1 | 2 |
| 13 | Wealthy Households - Families With School Age Children | 1 | 3 |
| 14 | Wealthy Households - Older Families & Mature Couples | 1 | 4 |
| 15 | Wealthy Households - Elders In Retirement | 1 | 5 |
| 21 | Prosperous Households - Pre-Family Couples & Singles | 2 | 1 |
| 22 | Prosperous Households - Young Couples With Children | 2 | 2 |
| 23 | Prosperous Households - Families With School Age Children | 2 | 3 |
| 24 | Prosperous Households - Older Families & Mature Couples | 2 | 4 |
| 25 | Prosperous Households - Elders In Retirement | 2 | 5 |
| 31 | Comfortable Households - Pre-Family Couples & Singles | 3 | 1 |
| 32 | Comfortable Households - Young Couples With Children | 3 | 2 |
| 33 | Comfortable Households - Families With School Age Children | 3 | 3 |
| 34 | Comfortable Households - Older Families & Mature Couples | 3 | 4 |
| 35 | Comfortable Households - Elders In Retirement | 3 | 5 |
| 41 | Less Affluent Households - Pre-Family Couples & Singles | 4 | 1 |
| 42 | Less Affluent Households - Young Couples With Children | 4 | 2 |
| 43 | Less Affluent Households - Families With School Age Children | 4 | 3 |
| 44 | Less Affluent Households - Older Families & Mature Couples | 4 | 4 |
| 45 | Less Affluent Households - Elders In Retirement | 4 | 5 |
| 51 | Poorer Households - Pre-Family Couples & Singles | 5 | 1 |
| 52 | Poorer Households - Young Couples With Children | 5 | 2 |
| 53 | Poorer Households - Families With School Age Children | 5 | 3 |
| 54 | Poorer Households - Older Families & Mature Couples | 5 | 4 |
| 55 | Poorer Households - Elders In Retirement | 5 | 5 |
# At this point, we need to deal with the NaN values in this feature
# Let's take a look at the distribution of values
sns.countplot(x="RR4: Life Stage Type - Int\'l Code Mapping",
data=gen_pop_lowMissing,
order=pd.Series(gen_pop_lowMissing["RR4: Life Stage Type - Int\'l Code Mapping"].unique()).sort_values())
# What fraction of this feature is missing values?
gen_pop_lowMissing["RR4: Life Stage Type - Int\'l Code Mapping"].isnull().sum() / len(gen_pop_lowMissing)
Hmmm...I don't see an obvious pattern here for how I'd impute this feature. As it's a categorical in its current form pre-engineering, I could use the mode (51) but I'd prefer to simply drop it, as this mode isn't obviously dominant across the present values and only 0.4% of all rows would be dropped anyhow.
# Drop the rows with missing int'l codes
gen_pop_lowMissing.dropna(subset=["RR4: Life Stage Type - Int\'l Code Mapping"],
inplace=True)
# Investigate "CAMEO_INTL_2015" and engineer two new variables.
# "CAMEO_INTL_2015" combines information on two axes: wealth and life stage.
# Break up the two-digit codes by their 'tens'-place and 'ones'-place digits into two new
# ordinal variables (which, for the purposes of this project, is equivalent to just treating
# them as their raw numeric values).
# Cast as int since it seems to keep coming up as object type
gen_pop_lowMissing["RR4: Life Stage Type - Int'l Code Mapping"] =\
gen_pop_lowMissing["RR4: Life Stage Type - Int'l Code Mapping"].astype(int)
# Extract the tens digit - wealth code
gen_pop_lowMissing["RR4: Life Stage Type - Int'l - Wealth"] = \
(gen_pop_lowMissing["RR4: Life Stage Type - Int'l Code Mapping"] / 10).astype(int)
# Extract the ones digit - life stage code
gen_pop_lowMissing["RR4: Life Stage Type - Int'l - Stage"] = \
gen_pop_lowMissing["RR4: Life Stage Type - Int'l Code Mapping"] % 10
# Should be between 1 and 5
gen_pop_lowMissing["RR4: Life Stage Type - Int'l - Wealth"].describe()
# Should be between 1 and 5
gen_pop_lowMissing["RR4: Life Stage Type - Int'l - Stage"].describe()
# Drop lingering RR4: Life Stage Type - Int'l Code Mapping feature
gen_pop_lowMissing.drop(columns="RR4: Life Stage Type - Int'l Code Mapping", inplace=True)
# What other features do I need to look at?
# Ignoring the two features I just worked on
feat_info_remaining[feat_info_remaining['type'] == 'mixed'].drop(labels=[22,59])
As you're typically better off when modeling to have more features instead of fewer ones (at least prior to feature selection and dimensionality reduction), especially if there is somewhat duplicative data within an original feature, I'm going to opt to split these features up whenever possible.
Here are my thoughts
Life Stage - HighRes and - LowRes: lots of info, but inconsistent levelsGeneration Designation feature (East vs. West, if you recall), we couldn't do that here since some values have a completely different set of levels than others do.Bldg: Neighborhood Quality: non-obvious ordering of valuesBldg: Neighborhood Quality seems at first glance to be clearly ordinal, until you realize that "rural" and "new building rural" aren't inherently less than a "very poor neighborhood"PLZ8: Most Common Bldg Type: mixed orderingPLZ8 features essentially capture the scale of majority building type if it is a certain size of multi-family home. The only new information provided in this feature is if it is primarily business buildings, so we'll simply make it a binary feature indicating if business buildings are the primary property type or not.# First, the easy ones: dummies for Life Stage - HighRes, and Life Stage - LowRes
cat_cols = ['Life Stage - HighRes',
'Life Stage - LowRes']
gen_pop_lowMissing = pd.get_dummies(
gen_pop_lowMissing, prefix_sep='__',
drop_first=True, dummy_na=True,
columns=cat_cols)
gen_pop_lowMissing.info()
Mapping for Bldg: Neighborhood Quality Re-Encoding:
| Original Code | Original Code Meaning | Neighborhood Quality Code | Rural Code |
|---|---|---|---|
| 0 | no score calculated | 0 | np.nan |
| 1 | very good neighborhood | 1 | 0 |
| 2 | good neighborhood | 2 | 0 |
| 3 | average neighborhood | 3 | 0 |
| 4 | poor neighborhood | 4 | 0 |
| 5 | very poor neighborhood | 5 | 0 |
| 7 | rural neighborhood | 0 | 1 |
| 8 | new building in rural neighborhood | 0 | 2 |
# Map the rural codes to a new feature
rural_code_map = {
0: np.nan,
1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
7: 1,
8: 2
}
gen_pop_lowMissing['Bldg: Rural Type'] = \
data_mapper(gen_pop_lowMissing['Bldg: Neighborhood Quality'],
rural_code_map)
# Make dummies out of the multi-level Rural Type categorical
cat_cols = ['Bldg: Rural Type']
gen_pop_lowMissing = pd.get_dummies(gen_pop_lowMissing,
prefix_sep='__',
drop_first=True,
dummy_na=True,
columns=cat_cols)
# Make the rural codes effectively unscored
neighborhood_code_map = {
0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
7: 0,
8: 0
}
gen_pop_lowMissing['Bldg: Neighborhood Quality'] = \
data_mapper(gen_pop_lowMissing['Bldg: Neighborhood Quality'],
neighborhood_code_map)
Mapping for PLZ8: Most Common Bldg Type Re-Encoding:
| Original Code | Original Code Meaning | Business Bldg Code |
|---|---|---|
| 1 | mainly 1-2 family homes | 0 |
| 2 | mainly 3-5 family homes | 0 |
| 3 | mainly 6-10 family homes | 0 |
| 4 | mainly 10+ family homes | 0 |
| 5 | mainly business buildings | 1 |
biz_code_map = {
1: 0,
2: 0,
3: 0,
4: 0,
5: 1
}
gen_pop_lowMissing['PLZ8: Primarily Business Bldgs'] = \
data_mapper(gen_pop_lowMissing['PLZ8: Most Common Bldg Type'],
biz_code_map)
# Drop original feature, it's no longer useful
gen_pop_lowMissing.drop(columns=['PLZ8: Most Common Bldg Type'],
inplace=True)
Summarizing all of my earlier notes on the mixed-type features:
Generation Designation (PRAEGENDE_JUGENDJAHRE) was split into a feature focused on the decade of that person's generation and the movement they were associated with from that decade (avantgarde or mainstream)RR4: Life Stage Type - Int'l Code Mapping (CAMEO_INTL_2015), I split the feature into two new features that extracted information from the original feature about a person's wealth level and their current life stage.Life Stage - HighRes and Life Stage - LowRes as categorical variables and made dummies from themBldg: Neighborhood Quality, I made the assumption based on the levels of this variable that likely any codes that didn't mention rural buildings reflected areas that were not primarily rural. As such, I extracted a new multi-level categorical feature that indicates if a person's building is primarily in a rural neighborhood, in a rural neighborhood but a new building, or not in a rural neighborhood at all. I then made dummies out of this new feature. I otherwise left the feature as it was, since the neighborhood quality score (presumably for non-rural neighborhoods only) is of interval type (or at least ordinal).PLZ8: Most Common Bldg Type, I investigated the feature's allowed levels a bit and determined that it was mostly redundant with the other PLZ8 features we already have. That being said, I determined that it shouldn't be dropped as it contained information about how prominent business buildings are in the PLZ8 region, something I didn't see reflected in the other features. So I stripped that information out of the feature as a binary feature unto itself and dropped the rest of the information.In order to finish this step up, you need to make sure that your data frame now only has the columns that you want to keep. To summarize, the dataframe should consist of the following:
Make sure that for any new columns that you have engineered, that you've excluded the original columns from the final dataset. Otherwise, their values will interfere with the analysis later on the project. For example, you should not keep "PRAEGENDE_JUGENDJAHRE", since its values won't be useful for the algorithm: only the values derived from it in the engineered features you created should be retained. As a reminder, your data should only be from the subset with few or no missing values.
# Do whatever you need to in order to ensure that the dataframe only contains
# the columns that should be passed to the algorithm functions.
gen_pop_lowMissing.columns.values
The primary thing I was worried about was remembering to drop features that we had extracted information out of. I don't want a bunch of redundant information lying about during modeling. Looking through the column names, it appears I've managed to avoid this, so I'm satisfied!
Even though you've finished cleaning up the general population demographics data, it's important to look ahead to the future and realize that you'll need to perform the same cleaning steps on the customer demographics data. In this substep, complete the function below to execute the main feature selection, encoding, and re-engineering steps you performed above. Then, when it comes to looking at the customer data in Step 3, you can just run this function on that DataFrame to get the trimmed dataset in a single step.
def clean_data(df, feature_summary, missing_breakpoint):
"""
Perform feature trimming, re-encoding, and engineering for demographics
data
INPUTS
------
df: Demographics DataFrame
feature_summary: DataFrame that includes listing of the various missing value codes
for each feature
missing_breakpoint: int. Amount of missing values allowed in any given
row in order to retain it
OUTPUT: Trimmed and cleaned demographics DataFrame
"""
# -------------------------------------------------------------------
# convert missing value codes into NaNs, ...
# Convert the strings for the missing values from the feature summary
# To be proper lists of values to use for filling in NaNs
# First, remove brackets
# Then, split on comma separator
feature_summary.loc[:, 'missing_or_unknown'] = \
feature_summary.loc[:, 'missing_or_unknown'].str[1:-1].str.split(',')
def fill_missing(df, missing_codes_mapping, inplace=False):
'''
Parses dataframe of missing values and their mapping to individual feature names
and then fills any of those values found in a dataframe's matching feature columns
with np.nan.
Inputs
------
df: pandas DataFrame. Table with features that match the ones for which we have
missing mappings. Each sample is a person.
missing_codes_mapping: pandas DataFrame. Contains columns 'attribute' and
'missing_or_unknown' that map codes used for missing/unknown values to
features/attributes. 'missing_or_unknown' is expected to have elements
that are lists of str (usually ints, but sometimes chars or empty lists).
Returns
-------
df with NaN values filled in according to missing_codes_mapping
'''
# Use deep copy if inplace = False, otherwise use actual inputs
if inplace:
data = df
missing_codes = missing_codes_mapping
else:
data = df.copy(deep=True)
missing_codes = missing_codes_mapping.copy(deep=True)
def parse_missing_codes(code_list):
'''
Goes through a list of str and converts the elements of the list according to the needs
of the dtypes in our demographic data such that the results can be used for
filling in NaN values.
Inputs
------
code_list: list of str. List is expected to contain the chars, floats, or ints
that are codes indicating a missing or unknown value.
Returns
-------
list or np.nan. Each element of the list returned is typecast according to
the expected needs of the NaN-filling it will be doing. Empty lists
(or lists with only an empty string in them) are returned as np.nan.
'''
# Make sure list isn't just empty string
if '' not in code_list:
# Check if list can be converted to int without issues - if so, do it
try:
return [int(e) for e in code_list]
# Not all are cast-able to int
except ValueError:
return [float(e) if 'X' not in e else e for e in code_list]
else:
return np.nan
# Typecast missing value codes appropriately
missing_codes.loc[:, 'missing_or_unknown'] = \
missing_codes.loc[:, 'missing_or_unknown'].apply(parse_missing_codes)
# Create series that maps feature names (index) to missing codes (data)
code_map = pd.Series(data=missing_codes['missing_or_unknown'].values,
index=missing_codes['attribute'].values)
# When passing a Series into to_replace, index is key and data is value (like a dict)
data.replace(to_replace=code_map,
value=np.nan,
inplace=True)
return data
df = fill_missing(df, feature_summary)
# ------------------------------------------------------------------
# remove selected columns and rows, ...
# Removing outlier features, except the one that provides a birth year
df.drop(columns=['ALTER_HH', 'KBA05_BAUMAX', 'KK_KUNDENTYP',
'AGER_TYP', 'TITEL_KZ'], inplace = True)
# If I'm going to make sense out of these, I need more helpful names
# Read in the names mapping CSV as a dict
new_names = pd.read_csv('col_renaming.csv', header=None, index_col=0,
squeeze=True).to_dict()
df.rename(columns=new_names, inplace=True)
# Remove rows having more than one missing value
df = df.loc[df.isnull().sum(axis=1) <= missing_breakpoint,:]
# -------------------------------------------------------------------
# select, re-encode, and engineer column values.
# Re-encode categorical variable(s) to be kept in the analysis.
cat_cols = ['Bldg: Location Relative to E or W Germany',
'Small or Home Office Owner', 'Insurance Type',
'Nationality Based on Name', 'Shopper Type',
'Socioeconomic Status - LowRes', 'Family Type - LowRes',
'MoneyType__Primary', 'Energy Consumption Type',
'Consumption Channel Type', 'Bldg: Building Type',
'RR4: Life Stage Type - LowRes', 'Socioeconomic Status - HighRes',
'Family Type - HighRes', 'Vacation Habits',
'RR4: Life Stage Type - HighRes']
# Have to drop first dummy so avoid Dummy Variable Trap
# Have to include NaN dummy so zero-vector can't be interpreted ambiguously
# as either NaN or first dropped dummy (which are likely different)
df = pd.get_dummies(
df, prefix_sep='__',
drop_first=True, dummy_na=True,
columns=cat_cols)
def data_mapper(series, mapping_dict):
'''
Reads in a pandas Series object that represents the Generation Designation feature
and returns a re-encoded series according to mapping_dict.
Inputs
------
series: pandas Series of integer codes (1 through 15) representing different
Generation Designation values
mapping_dict: dict of form {designation_code: new_code} used to determine what
values to return
Returns
-------
pandas Series with the new codes
'''
# Since NaN values aren't always propagated as expected, do a quick check
print(f"There are {series.isnull().sum()} null values in the series \
{series.name} prior to extraction")
out = series.map(mapping_dict, na_action = 'ignore')
print(f"There are {out.isnull().sum()} null values in the series \
{series.name} after extraction")
return out
# For extracting decade of birth info from Generation Designation feature
decade_code_map = {
1: 0,
2: 0,
3: 1,
4: 1,
5: 2,
6: 2,
7: 2,
8: 3,
9: 3,
10: 4,
11: 4,
12: 4,
13: 4,
14: 5,
15: 5
}
df['Generation Decade'] = \
data_mapper(df['Generation Designation'],
decade_code_map)
# For extracting generational movement from Generation Designation feature
movement_code_map = {
1: 0,
2: 1,
3: 0,
4: 1,
5: 0,
6: 1,
7: 1,
8: 0,
9: 1,
10: 0,
11: 1,
12: 0,
13: 1,
14: 0,
15: 1
}
df['Generation Movement'] = \
data_mapper(df['Generation Designation'],
movement_code_map)
# Drop Generation Designation feature now that we've captured
# its information in the new features
df.drop(columns='Generation Designation', inplace=True)
# Drop the rows with missing int'l codes
df.dropna(subset=["RR4: Life Stage Type - Int\'l Code Mapping"],
inplace=True)
# Cast as int since it seems to keep coming up as object type
df["RR4: Life Stage Type - Int'l Code Mapping"] =\
df["RR4: Life Stage Type - Int'l Code Mapping"].astype(int)
# Extract the tens digit as an int - wealth code
df["RR4: Life Stage Type - Int'l - Wealth"] = \
(df["RR4: Life Stage Type - Int'l Code Mapping"] \
/ 10).astype(int)
# Extract the ones digit - life stage code
df["RR4: Life Stage Type - Int'l - Stage"] = \
df["RR4: Life Stage Type - Int'l Code Mapping"] % 10
# Drop lngering RR4: Life Stage Type - Int'l Code Mapping feature
df.drop(
columns="RR4: Life Stage Type - Int'l Code Mapping", inplace=True)
# Make dummies for Life Stage - HighRes, and Life Stage - LowRes
cat_cols = ['Life Stage - HighRes',
'Life Stage - LowRes']
df = pd.get_dummies(
df, prefix_sep='__',
drop_first=True, dummy_na=True,
columns=cat_cols)
# For extracting rural neighborhood status from Neighborhood Quality
rural_code_map = {
0: np.nan,
1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
7: 1,
8: 2
}
df['Bldg: Rural Type'] = data_mapper(df['Bldg: Neighborhood Quality'],
rural_code_map)
# Make dummies out of the multi-level Rural Type categorical
cat_cols = ['Bldg: Rural Type']
df = pd.get_dummies(df,
prefix_sep='__',
drop_first=True,
dummy_na=True,
columns=cat_cols)
# Exclude rural categories as though they weren't scored
neighborhood_code_map = {
0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
7: 0,
8: 0
}
df['Bldg: Neighborhood Quality'] = \
data_mapper(df['Bldg: Neighborhood Quality'],
neighborhood_code_map)
# For extracting business building dominance and dropping info about rest
biz_code_map = {
1: 0,
2: 0,
3: 0,
4: 0,
5: 1
}
df['PLZ8: Primarily Business Bldgs'] = \
data_mapper(df['PLZ8: Most Common Bldg Type'],
biz_code_map)
# Drop original feature, it's no longer useful
df.drop(columns=['PLZ8: Most Common Bldg Type'],
inplace=True)
# -------------------------------------------------------------------
# Return the cleaned dataframe.
return df
# Test out our new cleaning function
# Load in the general demographics data.
gen_pop = pd.read_csv('data/Udacity_AZDIAS_Subset.csv', sep=';')
# Load in the feature summary file and the new column name mapping
feat_info = pd.read_csv('data/AZDIAS_Feature_Summary.csv', sep=';')
feat_info['attribute_renamed'] = pd.read_csv('col_renaming.csv', header=None)[1]
gen_pop = clean_data(gen_pop, feat_info, missing_values_breakpoint)
gen_pop.info()
Before we apply dimensionality reduction techniques to the data, we need to perform feature scaling so that the principal component vectors are not influenced by the natural differences in scale for features. Starting from this part of the project, you'll want to keep an eye on the API reference page for sklearn to help you navigate to all of the classes and functions that you'll need. In this substep, you'll need to check the following:
.fit_transform() method to both fit a procedure to the data as well as apply the transformation to the data at the same time. Don't forget to keep the fit sklearn objects handy, since you'll be applying them to the customer demographics data towards the end of the project.Here's a reminder on our current situation with regards to missing values at a feature level:
missing = pd.DataFrame(gen_pop.isnull().sum()).rename(columns = {0: 'total missing'})
missing['percent missing'] = round(missing['total missing'] / len(gen_pop),2)
missing.sort_values('total missing', ascending = False, inplace = True)
missing[missing['total missing'] > 0]
len(missing[missing['total missing'] > 0])
Obviously, the feature that would require the most imputation is Birth Year. While I'm not thrilled about the fact that I'd be imputing this one substantially, my earlier logic on why it needs to be done (it's potentially a very useful age proxy variable, as Age Bin is very low resolution with 15-year bins) still holds.
Birth Year feature after all.Let's take a look at the distributions of non-missing values for each of these features to get an idea of what we're working with here.
sns.distplot(gen_pop['Birth Year'].dropna())
gen_pop['Birth Year'].describe()
code_map = {
1: '< 30 years old',
2: '30 - 45 years old',
3: '46 - 60 years old',
4: '> 60 years old'
}
sns.distplot(gen_pop['Age Bin'].dropna())
It looks like the majority Age Bin value (45-60 years old) corresponds roughly to the median Birth Year value of 1967 (given that the final year in the Birth Year feature is 2017, the midpoint of this bin would be 1965). Additionally, the Birth Year feature is left-skewed, which corresponds reasonably well with the Age Bin having its second peak for the oldest group (code 4, 60+ year olds). It seems like using Age Bin as a way to impute Birth Year is a pretty reasonable plan!
sns.distplot(gen_pop['RR1: Neighborhood Type'].dropna())
As this is an ordinal feature, I'll treat these values as though they form a histogram and choose to impute using the median since, as in many other cases, this distribution is skewed.
sns.distplot(gen_pop['RR1: Purchasing Power (bins)'].dropna())
gen_pop['RR1: Purchasing Power (bins)'].describe()
I know this will surprise the attentive reader but...I'm using the median again!
code_map = {
1: 'most likely',
2: 'very likely',
3: 'likely',
4: 'average',
5: 'unlikely',
6: 'very unlikely'
}
sns.distplot(gen_pop['HH: Probability of Children in Residence'].dropna())
gen_pop['HH: Probability of Children in Residence'].describe()
As I can see this feature being correlated with at least a few other features (e.g. Family Types and other regional-level family-oriented features), I hesitate to use a simple imputation method like the mode. Thus I'll impute using the median due to the amount of skew in this distribution.
sns.distplot(gen_pop['Health Type'].dropna())
gen_pop['Health Type'].describe()
At this point, I'll assume that I'm using the median to impute unless I see something surprising.
fig, ((ax1, ax2, ax3), (ax4, ax5, ax6), (ax7, ax8, ax9)) = plt.subplots(nrows=3, ncols=3)
#fig.suptitle("Distributions of Values for Features with 2-3% Missing Values")
feats = ['RR3: Bins of 1-2 Family Homes',
'RR3: Bins of 3-5 Family Homes',
'RR3: Bins of 6-10 Family Homes',
'RR3: Bins of 10+ Family Homes',
'RR3: Bin Counts of Bldgs',
'RR1: Movement Patterns',
'Generation Movement',
'Generation Decade']
# Plot it up!
for i, ax in enumerate((ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8)):
sns.distplot(gen_pop[feats[i]].dropna(), ax=ax)
plt.tight_layout()
OK, I don't see anything here that would disqualify the median as a reasonable imputation approach for these features. Let's do it.
Let's now see what the dataset would look like if we dropped all rows in which at least one of the features has 1% missing values. If this doesn't result in an egregious reduction in the dataset size, we'll just go ahead with that approach.
cols = list(missing[(missing['percent missing'] <= 0.01) &\
(missing['total missing'] > 0)].index)
cols
len(gen_pop.dropna(subset=cols)) / len(gen_pop)
That looks reasonable! We only lose about 2.6% of our rows by doing this, so we should be good.
| Feature Name | % of Rows Missing Values | Imputation Approach |
|---|---|---|
| Birth Year | 39% | Since Age Bin is missing a relatively small amount of values, I'll use the midpoint of the bin for a given row as the Birth Year after dropping the missing rows for Age Bin |
| RR1: Neighborhood Type | 7% | Impute using median |
| RR1: Purchasing Power (bins) | 7% | Impute using median |
| HH: Probability of Children in Residence | 7% | Impute using median |
| Health Type | 4% | Impute using median |
| RR3: Bins of 1-2 Family Homes | 3% | Impute using median |
| RR3: Bins of 3-5 Family Homes | 3% | Impute using median |
| RR3: Bins of 6-10 Family Homes | 3% | Impute using median |
| RR3: Bins of 10+ Family Homes | 3% | Impute using median |
| RR3: Bin Counts of Bldgs | 3% | Impute using median |
| RR1: Movement Patterns | 3% | Impute using median |
| Generation Movement | 3% | Impute using median |
| Generation Decade | 3% | Impute using median |
| Bldg: Number of HHs | 1% | DROP |
| PLZ8: Primarily Business Bldgs | 1% | DROP |
| PLZ8: Bins of 1-2 Family Homes | 1% | DROP |
| PLZ8: Bins of 6-10 Family Homes | 1% | DROP |
| PLZ8: Bins of 10+ Family Homes | 1% | DROP |
| PLZ8: Bin Counts of HHs | 1% | DROP |
| PLZ8: Bin Counts of Bldgs | 1% | DROP |
| PLZ8: Bins of 3-5 Family Homes | 1% | DROP |
| PLZ8: Number of Cars | 1% | DROP |
| Community: Share of Unemployment | <1% | DROP |
| Community: Share of Unemployment Relative to Parent County | <1% | DROP |
| Community: Size (bins) | <1% | DROP |
| Bldg: Number of Academic Title Holders | <1% | DROP |
| Age Bin | <1% | DROP |
| PostCode: Distance to City Center (bins) | <1% | DROP |
| PostCode: Density of HHs per km^2 (bins) | <1% | DROP |
| PostCode: Distance to Nearest Urban Center (bins) | <1% | DROP |
| Bldg: Distance to Point of Sale Category | <1% | DROP |
| RR1: Residential-Commercial Activity Ratio (categories) | <1% | DROP |
# If you've not yet cleaned the dataset of all NaN values, then investigate and
# do that now.
# Drop rows with 1% or fewer missing values
gen_pop.dropna(subset=cols, inplace=True)
# Use Age Bin to impute Birth Year values, using midpoint of associated Age Bin to dictate year
age_bin_map = {
1: '< 30 years old',
2: '30 - 45 years old',
3: '46 - 60 years old',
4: '> 60 years old'
}
# Assume data are all relative to 2017, as this is the maximum Birth Year value
# Earliest Birth Year is 1900, making greatest age 117
# Thus we'll assume 20 year midpoint for code 4 (assuming anyone 100+ is outlier)
age_bin_to_year = {
1: 2002,
2: 1979,
3: 1963,
4: 1937
}
# Make a series that is a mapping of Age Bin to birth years
mapped_series = gen_pop['Age Bin'].map(age_bin_to_year)
# Use this mapped series and fillna() on Birth Year
# to take midpoint Age Bin year as Birth Year
gen_pop['Birth Year'].fillna(mapped_series, inplace=True)
# Recompute missing values
missing = pd.DataFrame(gen_pop.isnull().sum()).rename(columns = {0: 'total missing'})
missing['percent missing'] = round(missing['total missing'] / len(gen_pop),2)
missing.sort_values('total missing', ascending = False, inplace = True)
cols = list(missing[missing['total missing'] > 0].index)
missing[missing['total missing'] > 0]
# Impute using the median for all remaining features
from sklearn.impute import SimpleImputer
imp_median = SimpleImputer(strategy='median')
gen_pop[cols] = \
imp_median.fit_transform(gen_pop[cols])
# Recompute missing values
missing = pd.DataFrame(gen_pop.isnull().sum()).rename(columns = {0: 'total missing'})
missing['percent missing'] = round(missing['total missing'] / len(gen_pop),2)
missing.sort_values('total missing', ascending = False, inplace = True)
missing[missing['total missing'] > 0]
There you go! No more missing values!
# Apply feature scaling to the general population demographics data.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
gen_pop_scaled = scaler.fit_transform(gen_pop)
scaler.n_samples_seen_
| Feature Name | % of Rows Missing Values | Imputation Approach |
|---|---|---|
| Birth Year | 37% | Since Age Bin isn't missing any values, I'll use the midpoint of the bin for a given row as the Birth Year |
| HH: Probability of Children in Residence | 1% | Drop the rows wherein this is missing |
| Bldg: Number of HHs | < 1% | Impute using the median |
| PLZ8: Number of Cars | < 1% | Impute using the median |
| Bldg: Distance to Point of Sale Category | < 1% | Impute using mode |
I found that there were a number of features with nonzero amounts of missing values remaining (32 in total). Of these, only Birth Year had more than 7% missing values relative to the total number of data rows (Birth Year had about 38%).
I utilized three techniques to impute values for these features:
Age Bin, which had relatively few missing values, to calculate likely values for Birth YearI've copied my imputation plan in a table above this cell as a reminder of what was specifically done for each feature in question.
After completing imputation and having a dataset with no missing values as a result, I standardized all features so that they would have zero mean and a variance of 1.
On your scaled data, you are now ready to apply dimensionality reduction techniques.
plot() function. Based on what you find, select a value for the number of transformed features you'll retain for the clustering part of the project.# Apply PCA to the data.
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(gen_pop_scaled)
# Investigate the variance accounted for by each principal component.
exp_var = pd.DataFrame(pca.explained_variance_ratio_, columns=['Explained Variance Ratio'])
exp_var.index.name = 'Principal Component'
exp_var['CumSum of Explained Variance Ratio'] = exp_var['Explained Variance Ratio'].cumsum()
exp_var
# Plot histogram of explained variance ratio as a function of PC
# with lineplot of cumsum on separate y-axis
fig, ax1 = plt.subplots()
# Explained variance ratio sequence
sns.lineplot(x=exp_var.index, y=exp_var['Explained Variance Ratio'], ax=ax1)
ax1.set_xlabel('Principal Component Number')
ax1.set(ylabel='Explained Variance Ratio')
# Make them share an x-axis but have decoupled y-axes
ax2 = ax1.twinx()
# Explained variance ratio sequence
sns.lineplot(x=exp_var.index, y=exp_var['CumSum of Explained Variance Ratio'],
color='red', ax=ax2)
ax2.set(ylabel='Cumulative Explained Variance Ratio')
Let's zoom in on that elbow in the cumulative plot...
exp_var.loc[125:175]
It looks like we get a cumulative explained variance ratio of 90% with 128 principal components (PCs), 95% with 145 principal components, and 99% with 176 principal components. There's clearly plenty of room here to do dimensionality reduction!
As it looks like the inflection point/elbow of the cumulative explained variance curve is roughly at 140 PCs, let's retain 145 of the PCs to get a nice round(ish) 95% of the explained variance captured, meaning we're only using roughly 57% of the total feature set now. That's a pretty good reduction!
# Re-apply PCA to the data with 145 components retained
pca = PCA(n_components=145)
gen_pop_scaled_pca = pca.fit_transform(gen_pop_scaled)
I found that the cumulative amount of variance explained by the principal components slowed down around 140 components, so I ultimately decided to use this inflection point as a guide and selected 145 components as my target for dimensionality reduction, as it managed to explain 95% of the variance of the dataset with 57% of the feature count.
Now that we have our transformed principal components, it's a nice idea to check out the weight of each variable on the first few components to see if they can be interpreted in some fashion.
As a reminder, each principal component is a unit vector that points in the direction of highest variance (after accounting for the variance captured by earlier principal components). The further a weight is from zero, the more the principal component is in the direction of the corresponding feature. If two features have large weights of the same sign (both positive or both negative), then increases in one tend expect to be associated with increases in the other. To contrast, features with different signs can be expected to show a negative correlation: increases in one variable should result in a decrease in the other.
# Map weights for the first principal component to corresponding feature names
# and then print the linked values, sorted by weight.
# HINT: Try defining a function here or in a new cell that you can reuse in the
# other cells.
def get_pca_feature_weights(pca, PC, plot=False):
'''
Takes a fitted PCA object and a principal component number and
returns the weights of the original named features that comprise
that principal component
Inputs
------
pca: fitted sklearn PCA object
PC: int. Index of the principal component (PC) for which you want weigts
by named original feature
plot: bool. If True, plots out distribution of weights with vertical line
at x=0 and barplot of weights by feature
Outputs
-------
pandas Series with original feature names as index and weights as values
'''
pca_weights = pd.DataFrame(data=pca.components_, columns=gen_pop.columns)
pca_weights.index.name='Principal Components'
out = pca_weights.loc[PC].sort_values(ascending=False)
out.name = f"Principal Component {PC} Weights"
if plot:
fig, (ax_dist, ax_bar) = plt.subplots(nrows=2)
#fig.suptitle(f"Distribution of Weights for Principal Component {PC}")
# Distribution
sns.distplot(out, ax=ax_dist)
# Draw vertical line at x=0 to see how centered the distribution is
ax_dist.axvline(x=0, color='red')
# Barplot
sns.barplot(x=out.index,
y=out.values,
ax=ax_bar)
# Draw vertical line at midpoint of distribution
ax_bar.axvline(x=len(pca_weights.columns)/2, color='red')
ax_bar.set(ylabel='Component Weight',
xlabel='Feature Name')
plt.tight_layout()
return out
# Weights for the first principal component (PC)
get_pca_feature_weights(pca, 0, plot=True)
# Map weights for the second principal component to corresponding feature names
# and then print the linked values, sorted by weight.
get_pca_feature_weights(pca, 1, plot=True)
# Map weights for the third principal component to corresponding feature names
# and then print the linked values, sorted by weight.
get_pca_feature_weights(pca, 2, plot=True)
A few things I've noticed in these results:
PLZ8: Bins of 6-10 Family Homes, HH: Net Income Bins, RR4: Life Stage Type - Int'l - Wealth, and PLZ8: Bins of 10+ Family Homes). Additionally, the 5th and 6th features by weight are also community-level.MoneyType__Minimalist; which is admittedly money-related (although at the person-level), RR1: Movement Patterns which likely relates to the amount of mobility in terms of where people in that region live or work; RR3: Bins of 1-2 Family Homes, PLZ8: Bins of 1-2 Family Homes, and RR3: Bin Counts of Bldgs. This suggests to me that the size and building-oriented nature of communities may be slightly more important than community wealth measures in this PC.Age Bin, MoneyType__Preparer, and Energy Consumption Type__3.0, which makes this one tricky to interpret, as a preparer mentality for finances, a "fair supplied" attitude towards energy consumption (whatever that means), and a person's age don't seem to have any obvious overlap in terms of meaning. If we continue down the list of weights, we find PersonalityType__Event, PersonalityType__Sensual, Life Stage - LowRes__2.0, and Customer Type as the next highest positive weights. There are some indications that age may be relevant in this PC (given the presence of Age Bin and the fact that Life Stage - LowRes__2.0 includes age profiles among other things) but likely consumer behavior is more important to this PC, as MoneyType__Preparer, Energy Consumption Type__3.0, PersonalityType__Event, and Customer Type each measure in some way general or product-specific consumption patterns. This is a pretty broad interpretation, so let's see if we can narrow it down by looking at the large negative weights too.Age Bin had a high positive weight. But then I took another look at the data dictionary and my own engineered codes for Generation Decade and realized that these weights were consistent with the concept of giving older individuals a higher value in this PC (e.g. for Age Bin, higher values indicate older individuals, whereas for Birth Year and Generation Decade, higher values indicate younger individuals). Additionally, Life Stage - LowRes__2.0 (a high positive weight feature) corresponds to "single low-income and average earners of higher age".PersonalityType__Dreamer, PersonalityType__Family, PersonalityType__Social, and PersonalityType__Cultured as the most influential features. As low values for these features represent high affinities for these personality types, it seems that this PC actually penalizes individuals that have strong affinities to these personality types. Since this is less informative than it could be, let's take a look at the large negative values before trying to even begin an interpretation.Gender, PersonalityType__Event, PersonalityType__Critical, PersonalityType__Dominant, and PersonalityType__Combative. Again, as these features all have low scores for high affnities, this suggests that these personality types are actually dominant features in this PC and, since Gender has a value of 1 for men and 2 for women, it suggests that a strong preference in the PC is given to men.Gender is the largest of the weight magnitudes in this PC. As such, it's probably not crazy to say that the short description of this PC is gender-based (with stereotypical personality types for the two genders).You've assessed and cleaned the demographics data, then scaled and transformed them. Now, it's time to see how the data clusters in the principal components space. In this substep, you will apply k-means clustering to the dataset and use the average within-cluster distances from each point to their assigned cluster's centroid to decide on a number of clusters to keep.
.score() method might be useful here, but note that in sklearn, scores tend to be defined so that larger is better. Try applying it to a small, toy dataset, or use an internet search to help your understanding.# Over a number of different cluster counts...
from sklearn.cluster import KMeans
# Max allowable cluster count
limit = 30
ks = []
scores = []
# Progress bar to see where we're at
from tqdm import tqdm
for k in tqdm(range(2,limit)):
# Try to speed this up a bit by using 75% of the available cores
clusterer = KMeans(n_clusters=k, n_jobs=-1)
clusterer.fit(gen_pop_scaled_pca)
score_avg = clusterer.score(gen_pop_scaled_pca)
ks.append(k)
scores.append(score_avg)
print(f"For {k} clusters, the within-cluster score \
is {score_avg}")
k_df = pd.DataFrame({'k': ks,
'Within-Cluster Score': scores
})
# Plot the scores at different values of k
fig, ax = plt.subplots()
sns.lineplot('k', 'Within-Cluster Score', data = k_df,
ax=ax)
ax.set(xlabel="Number of Clusters")
The elbow method is a pretty subjective evaluative technique as it is typically used, so let's see if we can be a bit more numeric about it. Effectively, it is a matter of derivatives. The first derivative is the slope of these curves, $\frac{dScore}{dk}$. To look for the elbow in the score vs. k curve wherein we stop gaining as much information per increased cluster count, we can take the first derivative and look for where we transition from a high slope region to a low slope region.
So, let's look at the first derivative (or a simple approximation of it), shall we?
# Just doing delta-score/delta-k as the deriviatve
# Assuming delta-k = 1
k_df['Within-Cluster Derivative / dk'] = \
k_df.diff(periods=1)['Within-Cluster Score']
fig, ax = plt.subplots()
sns.lineplot('k', 'Within-Cluster Derivative / dk', data = k_df,
ax=ax)
ax.set(xlabel="Number of Clusters")
ax.axhline(color='red')
plt.tight_layout()
Looks like $k<10$ is probably the region of interest. Let's zoom in.
# Zoom-in
x_zoom = (0,10)
fig, ax = plt.subplots()
sns.lineplot('k', 'Within-Cluster Derivative / dk', data = k_df,
ax=ax)
ax.set(xlabel="Number of Clusters", xlim=x_zoom)
ax.axhline(color='red')
plt.tight_layout()
It looks like our optimal value of k that minimizes the cluster count while maximizing the information we gain per cluster is $k=6$. It's the point at which we see the slope of the within-cluster score starting to flatten out, indicating that we are no longer accelerating in the amount of within-cluster variance being explained by increasing counts of clusters. We seem to have found our elbow!
# Re-fit the k-means model with the selected number of clusters and obtain
# cluster predictions for the general population demographics data.
clusterer = KMeans(n_clusters=6, n_jobs=-1)
cluster_labels = clusterer.fit_predict(gen_pop_scaled_pca)
# What does the membership of the general population look like for the cluster labels?
sns.distplot(cluster_labels)
I took the derivative of each score with respect to k in order to identify the k value at which we stopped being able to explain the variance of our clusters (in the case of the within-clusters score) at an accelerated rate. I defined this as the point in which the slope of the within-cluster variance score flattened out, indicating a constant rate of variance explaining preceded by an accelerated rate of the same. Based upon this interpretation of the numeric approach to the Elbow Method, a k value of 6 seems most reasonable.
Now that you have clusters and cluster centers for the general population, it's time to see how the customer data maps on to those clusters. Take care to not confuse this for re-fitting all of the models to the customer data. Instead, you're going to use the fits from the general population to clean, transform, and cluster the customer data. In the last step of the project, you will interpret how the general population fits apply to the customer data.
;) delimited.clean_data() function you created earlier. (You can assume that the customer demographics data has similar meaning behind missing data patterns as the general demographics data.).fit() or .fit_transform() method to re-fit the old objects, nor should you be creating new sklearn objects! Carry the data through the feature scaling, PCA, and clustering steps, obtaining cluster assignments for all of the data in the customer demographics data.# Load in the customer demographics data and re-import feature summary data for safety's sake.
customers = pd.read_csv('data/Udacity_CUSTOMERS_Subset.csv', sep=';')
feat_info = pd.read_csv('data/AZDIAS_Feature_Summary.csv', sep=';')
customers.head()
customers.info()
# Clean the customer data
# Use external script built using the previous version of clean_data()
# but that also includes imputation
from clean_data import enhanced_clean
customers = enhanced_clean(customers, feat_info, missing_values_breakpoint, imp_median)
customers.info()
That's weird, this has one less feature than the gen_pop dataset. Let's see what's going on here...
# What column is in gen_pop that isn't in customers?
gen_pop.columns[~gen_pop.columns.isin(customers.columns)]
Looks like the Bldg: Building Type feature doesn't have a value of 5 in all of the customers dataset. We can fix this. We'll just create the proper dummy variable/feature and give it a constant value of 0, since none of the features have that value.
customers['Bldg: Building Type__5.0'] = 0
# Do we still have a column mismatch?
gen_pop.columns[~gen_pop.columns.isin(customers.columns)]
Fantastic, that did the trick!
# Let's see if the assumption of missing values being equivalent across
# datasets is valid...
missing = pd.DataFrame(customers.isnull().sum()).rename(columns = {0: 'total missing'})
missing['percent missing'] = round(missing['total missing'] / len(customers),2)
missing.sort_values('total missing', ascending = False, inplace = True)
missing
print(f"There are {customers.isnull().sum().sum()} missing values remaining in the dataset.")
Fantastic, no missing values after cleaning and imputation!
# Apply feature scaling to the customer data
customers_scaled = scaler.transform(customers)
customers_scaled.shape
# Do dimensionality reduction (PCA) on the customer data
customers_scaled_pca = pca.transform(customers_scaled)
customers_scaled_pca.shape
# Label the clusters in the customer data
cluster_labels_customers = clusterer.predict(customers_scaled_pca)
sns.distplot(cluster_labels_customers)
At this point, you have clustered data based on demographics of the general population of Germany, and seen how the customer data for a mail-order sales company maps onto those demographic clusters. In this final substep, you will compare the two cluster distributions to see where the strongest customer base for the company is.
Consider the proportion of persons in each cluster for the general population, and the proportions for the customers. If we think the company's customer base to be universal, then the cluster assignment proportions should be fairly similar between the two. If there are only particular segments of the population that are interested in the company's products, then we should see a mismatch from one to the other. If there is a higher proportion of persons in a cluster for the customer data compared to the general population (e.g. 5% of persons are assigned to a cluster for the general population, but 15% of the customer data is closest to that cluster's centroid) then that suggests the people in that cluster to be a target audience for the company. On the other hand, the proportion of the data in a cluster being larger in the general population than the customer data (e.g. only 2% of customers closest to a population centroid that captures 6% of the data) suggests that group of persons to be outside of the target demographics.
Take a look at the following points in this step:
countplot() or barplot() function could be handy..inverse_transform() method of the PCA and StandardScaler objects to transform centroids back to the original data space and interpret the retrieved values directly.# Put cluster labels into PCA dataframes
# so the principal component interpretations can be leveraged
# Make scaled arrays DataFrames for exploration
# Assuming here that the features are ordered
# from greatest (0) to least (144) explained variance
col_names = [f"PC{i}" for i in range(0,gen_pop_scaled_pca.shape[1])]
gen_pop_scaled_pca = pd.DataFrame(gen_pop_scaled_pca, columns=col_names)
customers_scaled_pca = pd.DataFrame(customers_scaled_pca, columns=col_names)
# Add in cluster labels
gen_pop_scaled_pca['Cluster Label'] = cluster_labels
customers_scaled_pca['Cluster Label'] = cluster_labels_customers
customers_scaled_pca.info()
# Create DataFrames that retain deleted rows for analysis
# Note that, for simplicity's sake, this is just the raw data with no cleaning or imputation
gen_pop_raw = pd.read_csv('data/Udacity_AZDIAS_Subset.csv', sep=';')
customers_raw = pd.read_csv('data/Udacity_CUSTOMERS_Subset.csv', sep=';')
gen_pop_raw.info()
# What rows are missing in one but not the other?
dropped_rows_cust = customers_raw[~customers_raw.index.isin(customers.index)].index
retained_rows_cust = customers_raw[customers_raw.index.isin(customers.index)].index
dropped_rows_gen = gen_pop_raw[~gen_pop_raw.index.isin(gen_pop.index)].index
retained_rows_gen = gen_pop_raw[gen_pop_raw.index.isin(gen_pop.index)].index
# Set indices for newly-created PCA-derived DataFrames so they match original DataFrame indices
gen_pop_scaled_pca.index = retained_rows_gen
customers_scaled_pca.index = retained_rows_cust
# Set rest of labels accordingly
gen_pop_raw = gen_pop_raw.join(gen_pop_scaled_pca['Cluster Label'],
how='left')
customers_raw = customers_raw.join(customers_scaled_pca['Cluster Label'],
how='left')
# Set cluster labels for rows that were deleted to 6 (max label + 1)
gen_pop_raw.loc[dropped_rows_gen, 'Cluster Label'] = 6
customers_raw.loc[dropped_rows_cust, 'Cluster Label'] = 6
# Check to make sure we haven't generated a bunch of null values as a result of JOINs
temp = gen_pop_raw['Cluster Label'].isnull().sum()
print(f"Gen_pop label nulls = {temp}")
temp = customers_raw['Cluster Label'].isnull().sum()
print(f"Customer label nulls = {temp}")
There we go! The cluster labels appear to be pushed into DataFrames I can explore now.
# Compare the proportion of data in each cluster for the customer data to the
# proportion of data in each cluster for the general population.
fig, (ax_general, ax_customer) = plt.subplots(nrows=2, sharex=True)
sns.distplot(gen_pop_raw['Cluster Label'], ax=ax_general)
ax_general.set(title="General Population Clusters",
xlabel="", ylabel='Count')
sns.distplot(customers_raw['Cluster Label'], ax=ax_customer)
ax_customer.set(title="Customer Clusters",
xlabel='Cluster Label', ylabel='Count')
plt.tight_layout()
Well it's certainly clear from this simple comparison that the two populations (general population and customers) are not equivalent in terms of cluster results. Let's dive a little more deeply into this and directly compare the percentages of each population present in each cluster.
Note: cluster 6 corresponds to the "fake" cluster comprised of individuals that were dropped in the preprocessing due to them having more than one missing value. Since these were not qualitatively similar to individuals with low missing value counts in every feature we spot-checked, we're treating them as a unique cluster unto themselves. This isn't a rigorous way of defining that cluster label, admittedly, but it's better than ignoring them completely!
# What does the % of a population's membership in each cluster look like
# across both populations?
# Customer data
cluster_counts_cust = pd.DataFrame(customers_raw['Cluster Label'].value_counts()).reset_index()
cluster_counts_cust.rename(columns={'index': 'Cluster', 'Cluster Label': 'Count'}, inplace=True)
cluster_counts_cust.sort_values('Cluster', inplace=True)
# General population data
cluster_counts_gen = pd.DataFrame(gen_pop_raw['Cluster Label'].value_counts()).reset_index()
cluster_counts_gen.rename(columns={'index': 'Cluster', 'Cluster Label': 'Count'}, inplace=True)
cluster_counts_gen.sort_values('Cluster', inplace=True)
# Add in % values too
cluster_counts_gen['Percentage'] = cluster_counts_gen['Count'] / cluster_counts_gen['Count'].sum()
cluster_counts_cust['Percentage'] = cluster_counts_cust['Count'] / cluster_counts_cust['Count'].sum()
# What does the % of a population's membership in each cluster look like
# across both populations?
# Visualize it.
fig, ax = plt.subplots()
sns.lineplot(x='Cluster', y='Percentage', data=cluster_counts_gen,
ax=ax, label='General Population')
sns.lineplot(x='Cluster', y='Percentage',
data=cluster_counts_cust, ax=ax, label='Customers')
ax.legend()
As we're most interested in why segments of the population are particularly attracted to or repelled by the company's products, we'll focus on the latter two groups for the remainder of this analysis.
# What kinds of people are part of a cluster that is overrepresented in the
# customer data compared to the general population?
# Cluster 2
# Since these clusters were based off of PCA components and not the original untransformed
# features (except for cluster 6), I'll focus on the PCA components for the most part
# Let's first look at the first 3 principal components and how our clusters look in that space
# as I've already provided interpretations of those components earlier
fig, (ax1_0, ax2_0, ax2_1) = plt.subplots(nrows=3, figsize = (6,10))
sns.scatterplot(data=customers_scaled_pca, x='PC0', y='PC1', hue='Cluster Label',
ax=ax1_0, palette='rainbow_r', alpha = 0.5, legend=False)
ax1_0.set(xlim=(-10,10), ylim=(-10,10))
sns.scatterplot(data=customers_scaled_pca, x='PC0', y='PC2', hue='Cluster Label',
ax=ax2_0, palette='rainbow_r', alpha = 0.5, legend='full')
ax2_0.set(xlim=(-10,10), ylim=(-10,10))
sns.scatterplot(data=customers_scaled_pca, x='PC1', y='PC2', hue='Cluster Label',
ax=ax2_1, palette='rainbow_r', alpha = 0.5, legend=False)
ax2_1.set(xlim=(-10,10), ylim=(-10,10))
# Put the legend outside the figure
# Adapted from https://stackoverflow.com/questions/30490740/move-legend-outside-figure-in-seaborn-tsplot
ax2_0.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.tight_layout()
plt.savefig('data/PC0_1_2.png')
# Let's take a look at this in 3D
px.scatter_3d(customers_scaled_pca, x='PC0', y='PC1', z='PC2', color='Cluster Label',
opacity=0.5)
# Use the cluster centroids for each label as a proxy
# for what the most important principal components are for each cluster
cluster_centers = pd.DataFrame(clusterer.cluster_centers_)
cluster_centers.index.name = 'Cluster Label'
cluster_centers.columns = col_names
cluster_centers
def top_cluster_coords(df, cluster_label, top_k):
'''
Takes information about the cluster centroid coordinates for all cluster labels
and returns the top_k positive and negative coordinate values for a given cluster's
centroid. These principle components can then be interpreted in human-understandable
terms to figure out what a reasonable description of the membership of a cluster is.
Inputs
------
df: pandas DataFrame with cluster labels as the index values and
the centroid coordinate for each feature as columns
cluster_label: int. Label of the cluster you want to investigate
top_k: int. Number of most positive and most negative coordinates you want returned
Returns
-------
pandas DataFrame with a single column for the coordinates of the top positive
and negative principal components (PCs)
'''
# What are the top k most positive and top k most negative PCs according to the centroid?
top = pd.DataFrame(pd.concat([df.loc[cluster_label].sort_values(ascending=False)[:top_k],
df.loc[cluster_label].sort_values(ascending=False)[-top_k:]]))
top.rename(columns={0: 'Top PCs', 1:'Top PCs'}, inplace=True)
return top
# What kinds of people are part of a cluster that is overrepresented in the
# customer data compared to the general population?
# Cluster 2
top_PCs_c2 = top_cluster_coords(cluster_centers, 2, 3)
top_PCs_c2
# What kinds of people are part of a cluster that is underrepresented in the
# customer data compared to the general population?
# Clusters 0, 3, and 4
top_PCs_c0 = top_cluster_coords(cluster_centers, 0, 3)
top_PCs_c3 = top_cluster_coords(cluster_centers, 3, 3)
top_PCs_c4 = top_cluster_coords(cluster_centers, 4, 3)
# Cluster 0
top_PCs_c0
# Cluster 3
top_PCs_c3
# Cluster 4
top_PCs_c4
get_pca_feature_weights(pca, 2)
Further PC Interpretations to Aid in Cluster Interpretation:
First of all: cluster 6.
Age Bin distributions seem to indicate that cluster 6 is likely younger on average than the other clustersCustomer Type distributions are roughly similar between cluster 6 and the other clusters, but cluster 6 is more heavily "normal returner" type than the rest of the clusters (which are more dominated by the minimal returner, incentive-receptive customer type). As this customer type is pretty heavily represented in both groups, it's fair to say that cluster 6 is more biased towards this "normal returner" type than anything else.Vacation Habits is roughly uniformly distributed by type across the other clusters in aggregate.Some thoughts based upon assessment of the visualization of the first three principal components:
As a reminder, the first three principal components (PCs) can be described in human-understandable terms as such, based upon my earlier interpretations:
Cluster 2 - company's products are popular
Of course, it's not reasonable to assume that these first three PCs are going to be the dominant features of every cluster, just because they explain the most variance for the aggregate dataset. So to make sure we truly capture the most important PCs to each cluster of interest, we'll take a semi-supervised approach to this problem: we'll consider the cluster label to be our target variable in a supervised learning sense, and use techniques from that branch of machine learning to determine the most important features/PCs for each cluster we care about.
One approach that would work for this is to use a supervised model (of any sort really) to predict the cluster labels and then use some post hoc technique potentially (e.g. LIME) to explain what features contributed to each predicted label. This would involve setting up a pipeline, likely doing cross-validation, scoring, etc. and would be a bit of an undertaking however. As this would be a non-binary classification task, I can't even easily utilize techniques like SelectKBest unfortunately. As such, I'm going to go with the simpler, albeit less comprehensive, approach of simply inspecting the cluster centroid
Some thoughts from the cluster centroid analysis:
Based upon my assessment of the cluster centroids for the groups over- and underrepresented in the customer dataset relative to the general population, I expect that this company's products would best be targeted at renters with high incomes that care about the environment. The company would likely benefit from using advertising that focuses on environmental and/or personal benefits, but not try to use messages focused on community/societal gains.
There are also certain portions/segments of the population that are particularly not in favor of this company's products. Specifically, this includes folks who are not motivated by environmental issues and tend to be swayed by community values arguments, essentially the opposites of the high affinity group.